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Successive whole genome duplications have recently been firmly 
established in all major eukaryote kingdoms. It is not clear, how- 
ever, how such dramatic evolutionary process has contributed 
to shape the large scale topology of protein-protein interaction 
(PPI) networks. We propose and analytically solve a generic 
model of PPI network evolution under successive whole genome 
duplications. This demonstrates that the observed scale-free de- 
gree distributions and conserved multi-protein complexes may 
have concomitantly arised from i) intrinsic exponential dynamics 
of PPI network evolution and ii) asymmetric divergence of gene 
duplicates. This requirement of asymmetric divergence is in fact 
"spontaneously" fulfilled at the level of protein-binding domains. 
In addition, domain shuffling of multi-domain proteins is shown 
to provide a powerful combinatorial source of PPI network in- 
novation, while preserving essential structures of the underlying 
single-domain interaction network. Finally, large scale features 
of PPI networks reflecting the "combinatorial logic" behind di- 
rect and indirect protein interactions are well reproduced numer- 
ically with only two adjusted parameters of clear biological sig- 
nificance. 

Gene duplication is considered the main evolutionary source of 
new protein functions (J. Although long suspected jJ 01, whole 
genome duplications have only been recently confirmedjjH000] 
through large scale comparisons of complete genomes0,|31. 

Whole genome duplications are rare evolutionary transitions fol- 
lowed by random nonfunctionalization of most gene duplicates on 
time scales of about 100MY (with large variations between genes, see 
discussion). Whole genome duplications presumably provide unique 
opportunities to evolve many new function al g enes at once through 
accretion of functional domains fTol ITU 173, 1 1 3l fT^tl from contiguous 
pseudogenes (or redundant genes) and may also promote speciation 
events by preventing genetic recombinations between close descen- 
dants with different random deletion patterns. 

Recent whole genome duplications (WGDs) within the last 
500MY (about 15% of life history) have now been firmly estab- 
lished in all major eukaryote kingdoms. For instance, there are 4 
consecutive WGDs between the seasquirt Ciona intestinalis and the 
common carp Cyprinus carpio, with most tetrapods (including mam- 
mals) in between at +2WGDs from seasquirt and — 2WGDs from 
carp and most bony fish at +3WGDs from seasquirt and — lWGDs 
from c arp (a ps eudotetraploid bony fish duplicated about 10MY 
ago) 1^1^ 1151 flal . There are also 3 consecutive WGDs in the recent 
evolution of the flowering plant Arabidopsis thaliana^ and at least 
3 consecutive WGDs for the protist Paramecium tetraurelia (Patrick 
Wincker, personal communication). Extrapolating these 500MY old 
records, one roughly expects a few tens consecutive WGDs (or equiv- 
alent "doubling events") since the origin of life. These rare but dra- 
matic evolutionary transitions must have had major consequences on 



the evolution of large biological networks, such as protein-protein 
interaction (PPI) networks. 

From a theoretical point of view, we also expect that alternat- 
ing whole genome duplications and extensive gene deletions lead 
to exponential dynamics of PPI network evolution. In the long 
time limit, this should outweigh all time-linear dynamics that have 
been assumed in PPI network evolution models under local structure 
changes m El El El Ell El El (see discussion). In fact, the in- 
trinsic exponential dynamics of genome evolution is already transpar- 
ent from the wide distribution of genome sizes Q@J and proliferation 
of repetitive elements l24ll : it is hard to imagine that the 10 4 -fold span 
in lengths of eukaryote genomes could have solely arised through 
time-linear increases (and decreases) in genome sizes.* 

Modelling PPI network evolution by whole genome duplication 

We propose a simple model of PPI network evolution focussing on 
whole genome duplication (extensions to local or partial genome du- 
plication are presented in ref l25ll and confirm the conclusions of this 
paper). Each time step n corresponds to a whole genome duplica- 
tion and leads to a complete duplication of the PPI network, whereby 
each node is duplicated ( x 2) and each interaction quadruplated ( x 4) 
as depicted on Fig. 1 . Links from the duplicated network are then kept 
with different probabilities 7, (0 < 7* < 1) reflecting symmetric or 
asymmetric divergences between protein or link copies. 

The interaction network is caracterized at each step n by its num- 
ber of nodes with k neighbours iV^™' and its total number of links 
l/™' = ^2 k>1 kN^ /2. As stochastic differences exist between 
network realizations, we study the evolution of typical networks by 
introducing a generating function averaged over all network realiza- 
tions, 

F (n \x) = J2( N l n) ) xk - 0) 

fc>0 

This use of generating functions can in fact be generalized I25I1 to 
other, possibly non local features of interest (e.g. the average con- 
nectivity of first neighbors </fc|26] is introduced below). 

In the following, we discuss a general model of PPI network evolu- 
tion through whole genome duplication with asymmetric divergence 
of duplicated genes (Figs.l&2A). We compare it, first, to an alter- 
native model with symmetric protein divergence but random link 
"complementation"(T2,|2j| (Fig.S 1), and also to direct physical inter- 
actions from Yeast PPI network data (Fig. 2B&C). We then redefine 
this initial asymmetric divergence model (Fig. 1) in terms of protein- 
binding domains (Figs. 3A&B) to account for indirect protein-protein 
interaction within multi-protein complexes (Figs. 3A&C). 

Asymmetric divergence of duplicated proteins 

The case of asymmetric divergence between duplicated genes 
corresponds to the following evolution scenario; while duplicated 
proteins are initially equivalent and experience, at first, the same 
functional constraints pal, their divergence becomes eventually 
asymmetric (2^, 13(1 13 ill (see discussion). This presumably occurs 
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FIG. 1 : Model of protein-protein interaction network evolution through whole genome duplication. Whole genome duplications are followed by asym- 
metric divergence of protein duplicates with random distribution between genome copies (e.g. 1/1' vs 2/2'): "New" duplicates are left essentially free to 
accumulate neutral mutations with the likely outcome to become nonfunctional and eventually deleted unless some "new", duplication-derived interactions are 
selected; "Old" duplicates, on the other hand, are more constrained to conserve "old" interactions already present before duplication. The duplicated network 
with quadraplated links is graphically rearranged for convenience into old and new network copies (e.g. 2 and 2' duplicated nodes are swapped here). Links 
from the duplicated network are then kept with different probabilities ■ji (0 < -fi < 1) reflecting this asymmetric divergence between protein duplicates. An 
alternative model based on symmetric divergence of protein duplicates and random link "complementation' ' 1 1 9l . l27ll is illustrated in Fig. SI and discussed in the 
text. 



once one duplicate copy has lost an essential interaction and thus 
function, which has then to be fullfilled entirely by the other dupli- 
cate. The evolution of this latter duplicate is, from then on, more 
constrained to retain "old" interactions, while the former duplicate is 
left largely free to accumulate more neutral mutations with the likely 
outcome to become nonfunctional, unless some "new", duplication- 
derived interactions are selected, Fig. 1 (new interactions arising 
from horizontal gene transfer are more characteristic of prokaryote 
evolution l32ll and neglected here 12^1 ). Note that "old" and "new" 
labels in Fig. 1 refer to the asymmetric conservation and fate of du- 
plicates after WGD (and not to specific genome copies). Functional- 
ization patterns of duplicated genes are further discussed in the sup- 
porting information. 

The recurrence relation for the generating function Q is derived 
as follows: since each node is initially duplicated, F^ n+1 \x) is the 
sum of two F^ (x) where x is first replaced by x 2 (since each node 
degree can at most double) and then substituted as x — > -yiX-\-5i where 
7» [resp. St = 1 — ji] corresponds to the probability to keep [resp. 
delete] each link emerging from each node of the duplicated graph. 
Hence, the generating function recurrence for PPI network evolution 
with asymmetric divergence of duplicated proteins yields, 



F 



(n+1) (x) = F {n) (( 7 x+5)( 7 nX+J n )) +F {n) ((-yx+S)( lo x+S )). 

(2) 

where 7, 7 n and 7 [resp. 5, 5 n and S ] stand for 7 cr oss, 7new and 
7oid [resp. Across, <5 nC w and <5 id] in Fig.l (see supporting information 
for proof details). 

The overall graph dynamics through successive global duplica- 
tions is clearly exponential as anticipated; in particular, the total num- 
ber of nodes grows as F^ n ' (1) = A-2 n , where A is the initial number 
of nodes, and the number of links scales as (I/™'} oc (27+70 +7n)™- 
We remove permanently disconnected nodes from the list of relevant 
nodes, assuming that they correspond to proteins that have in fact 
lost their function and are eventually eliminated from the genome. 
To this end, we redefine the graph size as, (N^) = Ylk>i(^k^) 



and introduce a normalized generating function p 
degree distribution, 



for the mean 
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Absolute and relative generating functions are related through, 

F (n) (x) = p {n) (x)(N {n) ) + (7V ( < n) ) . 



(3) 



(4) 



Inserting this expression J4j in recurrence gives a closed relation 
between successive jjC") ( x ), 



p (n+1) (:r) = l- (5) 
2 - ((~/x+8)( ln x+5 n )) - p<"> ((~(x+8)(~f x+5 )) 
AW ' 

where A*™' is the ratio between consecutive numbers of connected 



nodes, A (n) = (7V (n+1) ) / '(iV (n) ) 



.An) 



{SS n )-p (n) (S6 )<2. 



The evolution of the mean degree is obtained by taking the first 
derivative of |5} at x = 1: 



d x p [n+L, (l) = 



-r c 



(6) 



where F n = 7+7 n and r o = 7+70 hereafter. 

We will limit the discussion here to degree distributions approach- 
ing a stationary regimes p^ n \x) — > p(x) with & finite mean degree 
1 < p'(l) < 00. This seems to cover the most biologically relevant 
networks; for completeness, other cases are discussed elsewhere (2^1. 
From {6j and the condition of finite mean degree, we readily obtain 
that A* n ^ — > r n + r o , which implies that the network evolution is 
asymptotically equivalent in terms of connected nodes and links, f 



(N {n+1) )/(N {n) ) -> (L {n+1) )/(L {n) ) = r n +r G < 2, 



(7) 



The stationary degree distribution is then solution of the functional 
equation, 



p(x) 



2 - p({'yx+S)(-y n x+8 n )) - p((-yx+S)('y x+S )) 



r n +r 



(8) 

which can be differentiated k times to express the fcth derivative in 
terms of lower derivatives, 



r k + r k 

A n "T" 

r„ + r 



= J2 «™(7n,7o,7)d;>(l) (9) 



m=[fc/2] 



where the coefficients a m are all positive from the definition J3J. 

The finite or infinite nature of d k p(l) depends on the two 
parameters T n and r o and defines the form of the limit degree 
distribution. The phase diagram Fig. 2A summarizes in the plane 
(r o + r n , r o — r n ) the different regimes for the asymptotic degree 
distribution pt- T + T n is the global growth rate of the network 
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FIG. 2: Analytical and numerical results of PPI Network evolution through whole genome duplication. A. Phase diagram for the limit degree distribution 
(see text). B&C. Comparison with protein direct physical interaction data for Yeast from BIND I33II and MIPS I34II databases: BIND (August 1 1, 2005 release), 
4576 proteins, 9133 physical interactions, k = 3.99, k 2 = 106 (filled symbols) and MIPS (downloaded online April 20, 2006), 4153 proteins, 7417 physical 
interactions, k = 3.57, k 2 = 78.6 (open symbols). Squares correspond to raw data, while circles and triangles are statistically averaged with gaps in connectivity 
distribution for large A; > 20, due to the finite size of Yeast PPI network. B. One-parameter fit of connectivity distribution data p^ (corresponding to the "X" 
mark in A., see text). Numerical connectivity distribution averaged over 10,000 network realizations (central green line). Numerical averages plus or minus two 
standard deviations (±2(7) are also displayed to show the predicted dispersions (upper and lower green lines) [Raw data (squares) do not fit within the mean 
±2(7 curves for large k due to the finite size of Yeast PPI network]. The fitting parameter 7 = 0.26 corresponds to an effective growth rate of 1 + 27 = 1.52. 
C. One-parameter fit of average connectivity of first neighbor proteins g h I26ll (i.e. k.g k sums connectivities of first neighbors from proteins of connectivity k). 
Numerical predictions averaged over 10,000 network realizations (central blue line). Numerical averages plus or minus two standard deviations are also displayed 
(upper and lower blue lines). Same fitting parameter value as in B, 7 = 0.26. Note that g k is rescaled by k/k 2 (as kg^ = k 2 holds for each network realization); 
this rescales large g k fluctuations between network realizations, due to the divergence of k 2 for p k ~ k~ a ~ with 2 > a > for the one-parameter model. 



(r o + r n > 1 to ensure a growing network) and r o — r n corresponds 
to the divergence asymmetry between duplicated proteins. We 
now discuss the two main stationary regimes for pt in the case of 
r n < T (the case T n > r o is deduced by permutating indices): 

• Exponential, non-conservative regime. If both r o < 1 and r n < 1, 

rS + it < r n + r Q , for ail k > 2 (10) 

and the factor in front of d^p(l) in |9j is always strictly positive, 
which implies that all derivatives of the limit degree distribution are 
finite. Hence, in this case, the limit degree distribution decreases 
more rapidly than any power law (see explicit asymptotic develop- 
ment in[25]). Note that this "exponential" regime occurs when the 
links emerging from each node (Fig. 1) are more likely lost than 
duplicated at each round of global duplication (as Tj = 7 + 71 < 1 
is equivalent to SS\ > 77O. This implies that most nodes eventually 
disappear, and with them all traces of network evolution, after just 
a few rounds of global duplication. The network topology is not 
conserved, but instead continuously renewed from duplication of the 
(few) most connected nodes. 

• Scale-free, conservative regime. If T > 1 > F n , the factor in front 
of d*p(l) in {5J can become negative. However, since the generating 
function should have all its derivatives positive, a negative value for 
one of them means that it simply does not exist. In fact, for F n In T n ± 
r o In r o > (red line in Fig.2A and l25lo . there is an integer r > 1 
such that, 

rn + r;<r n + r <r^ +1 + r^ +1 . (ii) 

implying that all derivatives d^pil) are finite up to the rth order, 
while dx +1 p(l) is infinite. This justifies the following asymptotic 
expansion of p(x) in the vicinity of x — 1, 



for some appropriate r < a < r + 1. This anzats is then inserted in 
JD using ('yx+S)(^ u , x+5 n , ) = l-r n , (l- x)+77n,o(l-a;) 2 to 
obtain an equation on the coefficients A\,... A r . The term A a does 
not mix with previous terms and gives the following equation for a, 

n + r? = r n + r Q . (13) 

The limit degree distribution follows a power law in this case,:f 

p k <xk- a -\ (14) 

(see red and blue "exponent" lines in Fig. 2A for a+1 = 2, 3, 4, . . .) 

Note that scale-free degree distributions emerge under successive, 
global network duplications only if the "old" node copy has its links 
more likely duplicated than lost at each round of global duplication 
(as r o = 7 + 70 > 1 is equivalent to 770 > SS ). Thus, "old" 
nodes statistically keep on increasing their connectivity once they 
have emerged as "new" nodes by duplication. This implies that most 
nodes and their surrounding links are conserved throughout the evo- 
lution process, thereby ensuring that local topologies of previous net- 
works remain embedded in subsequent networks. 

In summary, whole genome duplication with asymmetric diver- 
gence of duplicated proteins leads to the emergence of two classes of 
PPI networks with finite asymptotic degree distributions : i) PPI net- 
works with an exponential degree distribution and without conserved 
topology and ii) PPI networks with a scale-free limit degree distri- 
bution and at least local topology conservation. All other evolution 
scenarios, which do not lead to finite asymptotic degree distributions, 
are unlikely to model biologically relevant cases; they correspond ei- 
ther to an exponential disappearance of the whole PPI network (i.e. 
if r n + r o < 1) or to an exponential shift of all proteins towards 
higher and higher connectivities (i.e. dense regime in Fig. 2A for 

r n r Q > i)|2|. 



p(x) = i_Ai(l-ar) + . . . + (-l) r A r (l-x) r ~A cl (l~x) a -. . . , 

(12) 
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Symmetric divergence of duplicates with link "complementation" 

Another model of interest is the so-called "duplication-mutation- 
complementation" model initially proposed in the context of pro- 
tein network evolution through successive local duplications 113. 12711 . 
This model can be easily adapted to the context of PPI network evo- 
lution through whole genome duplication, Fig. S 1 . After each global 
duplication step, the probability to keep an instance of each interac- 
tion is now distributed randomly over the four equivalent links with- 
out reference to particular protein duplicates, unlike in the previous 
model. The complementation step (which ensures that at least one in- 
stance of each previous link is retained) can be enforced here through 
the "old" link copy (7 = 1) with 7 n corresponding to the "new" in- 
teraction sharing no node with 7 , while 7 still pertains to the last two 
equivalent cross links. This model is thus effectively symmetric from 
the protein point of view and readily yields the following recurrence 
for the generating function of the network degree distribution. 

F {n+1) {x) = 2F (n) (( 1X +S)(>y c x+S c )), (15) 

where 7 C = (7n + 7o)/2 and S c = (S n + S a )/2 are effective aver- 
age probabilities to retain or delete old and new links (see supporting 
information for proof details). Hence, the model of PPI network evo- 
lution with link complementation is in fact equivalent to the case of 
a symmetric divergence of duplicated proteins in the previous gen- 
eral model. Such symmetric divergence of duplicated proteins yields 
either a stationary exponential regime (T n + T a < 2, Fig.2A) or a 
non-stationary dense regime l25ll (T n + r o > 2, Fig.2A). 

Hence, the "duplication-mutation-complementation" model can- 
not lead to scale-free degree distributions, and thus to locally con- 
served network topology, in the context of whole genome duplication 
evolution, by contrast to the same model applied to local duplication 
with time-linear evolution fT^,l27ll . 

Fitting PPI network data with a one-parameter model 

Scale-free degree distributions have been widely reported for large 
biological networks and other exponentially growing networks like 
the WWW. We showed in the previous discussion that scale-free limit 
degree distributions require an asymmetric divergence of duplicated 
proteins (r o — T n = 70 — 7 n > 0) which corresponds to the prob- 
ability difference between conservation of old interactions (70) and 
coevolution of new binding sites (7 n ). The expected range of param- 
eters for actual biological networks is 1 ~ 70 > 7 > 7^ ~ 0; 
In particular, the most conservative (70 = 1) and least correlated 
(7n = 0) evolution scenario corresponds to the strongest divergence 
asymmetry between duplicated proteins (F — T n = 1, upper border 
on Fig.2A). The condition 70 = 1 ensures that not only local but 
also global topologies of all previous networks remain embedded in 
all subsequent networks. This model is effectively a one-parameter 
model (7) for PPI network evolution through whole genome duplica- 
tion. It converges towards a stationary scale-free limit degree distri- 
bution p fe ~ AT"" 1 with 0<a<2for0<7< (V5 - l)/2 
and generates non-stationary dense networks for (\/5 — l)/2 < 
7 < l l25ll . We used this one-parameter model to fit both the degree 
distribution (Fig.2B) and the average connectivity of first neighbors 
(Fig.2C) for direct physical interaction data of 5. cerevisiae taken 
from two databases, BINDQ and hand curated MIPsO (with 
presumabky fewer nonspecific spurious interactions (3^1). The pre- 
dicted asymptotic regime is in fact approached for k < 20 due to 
the finite size of Yeast PPI network. The fitting parameter 7 = 0.26 
corresponds to a fixed growth rate of 1 + 27 = 1.52 (i.e. the 
number of links and nodes increases by 52% at each global duplica- 
tion). Adding and removing up to 30% of links randomly, or draw- 
ing 7 from a uniform distribution between and 0.52 (with average 
7 = 0.26) yield remarkably similar fits (not shown) to the experi- 
mental data. This reveals a large insensibility to false- positive and 
negative noises and fluctuations in 7 (as long as the non- stationary 



dense regime is avoided, Fig.2A). The fixed (or averaged) growth 
rate of 52% at each round of global duplication is enough to generate 
networks of the size of 5. cerevisiae starting from a few interacting 
"seeds" after about 20 global duplications (i.e. 1.52 20 = 4334 times 
more nodes with an average of one global duplication per 200MY for 
4BY). Such scenario is not a priori incompatible with experimental 
data, as we only have clear records on global duplications dating back 
up to 400-500MY ago (i.e. only 10 to 20% of life history). Yet, these 
records suggest that "recent" whole genome duplications might be 
more frequent (every 100-150MY) and more selective (growth rates 
between 10 and 25%).§ 

Direct vs indirect protein-protein interactions 

The protein-protein interactions we have considered so far cor- 
respond to direct physical contact between protein pairs derived, 
for instance, from two-hybrid expression assavs l3^1 . However, we 
expect from the proposed scale-free fit of the degree distribution 
(Fig. 2B) that the underlying PPI network has conserved not only 
pairwise interactions during evolution but also some level of network 
topology (see above). The emergence of locally conserved topol- 
ogy in PPI network evolution leads "naturally" to conserved asso- 
ciations or "modules" between multiple proteins |j7l I3H, 13^. I4CL PHl 
and, beyond, to recurrent "motifs" across different types of biological 
networks jilEHiilillillilliilii. 

In fact, many biological functions are known to rely on multi- 
ple direct and indirect interactions within protein complexes. More- 
over, the combinatorial complexity of multiple-protein interactions 
is likely responsible for the remarkable diversity amongst living 
organisms I50I1 . despite their rather limited and largely shared genetic 
background (i.e. a few (ten) thousands genes built from a few hun- 
dreds families of homologous protein domains ll3lll4ll5lll52Tl "). 

High-throughput studies using affinity precipitation methods cou- 
pled to mass spectroscopy|53l l54ll55ll have proposed some 80,000 di- 
rect and indirect protein interactions for S. cerevisiae (raw data) and 
similar data are now becoming available for several other species. 

Yet, from a theoretical point of view, the evolution of indirect in- 
teractions is expected to depend not only on locally conserved net- 
work topology but also on the actual "combinatorial logic" between 
direct interactions. This cannot be readily defined on traditional PPI 
network representation (e.g. Fig. 1) and requires a somewhat more 
elaborate model as we now discuss. 

Redefining PPI network evolution in terms of protein domains 

Indirect protein interactions reflect the occurence of simultane- 
ous direct interactions within protein complexes. This requires that 
some proteins have more than one binding sites to simultenaously 
interact with several protein partners. Indeed, proteins with a sin- 
gle protein-binding site can only bind to one partners at a time, 
underlying a simple "XOR"-like combinatorial logic. By contrast, 
proteins with several protein-binding sites (which are usually multi- 
domain proteins) greatly increase the combinatorial complexity of bi- 
ological processes (like gene regulation or cell signaling) by adding 
"AND" operators to the computational logic between multiple di- 
rect interactions. Multi-domain proteins also provide a versatile sup- 
port for prot ein evolution through accretion or deletion of individual 
domains(TIl[Tll[T2[l!0. 

In addition, we note that binding sites l56H57ll on specific protein 
domains are likely the primary source of asymmetric divergence in 
PPI network evolution, as binding site mutations necessarily affect 
interactions with all binding partners (Fig. 1) and not just a random 
subset of them (Fig. SI). Hence, asymmetric divergence of pro- 
tein duplicates "naturally" originates from "spontaneous symmetry 
breaking" of their equivalent protein-binding sites (or domains). 

We propose to highlight this central role of protein domains in the 
evolution of PPI networks by simply redefining our initial asymmet- 
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FIG. 3 : Combining whole genome duplication and domain shuffling of multi-domain proteins. 

A. Protein-domain interaction network. Nodes now correspond to single binding domains in a protein-domain interaction network (solid lines). Multi-binding- 
domain proteins are introduced through a new type of links corresponding to covalent peptide bonds between protein domains (black dashed lines). This provides 
a graphical framework to distinguish mutually exclusive, direct interactions ("XOR") between protein domains from cummulative, indirect interactions ("AND") 
within multi-protein complexes (red dashed lines). B&C. Comparison with protein direct & indirect interaction data for Yeast from BIND|33] database (B&C 
filled symbols, indirect interactions from[53, 54]) and Ref|55] (C open symbols, see supporting information). Data are statistically averaged as in Fig. 2B&C 
to account for gaps in connectivities for large k > 20, due to the finite size of Yeast PPI network. B. Two-parameter fit of both direct connectivity distribution 
Pfc and average direct connectivity of first neighbor proteins 0i. l2dl (see Fig.2C and text). Numerical predictions are averaged over 1,000 network realizations 
(central green and blue lines). Numerical averages plus or minus two standard deviations are also displayed to show the predicted dispersions (upper and lower 
green and blue lines). The two adjusted parameters (7 = O.f and A = 0.3) correspond to a network growth rate of 20% and an average of 1.5 protein-binding 
sites (domains) per protein. The connectivity distribution of the underlying single-domain network (corresponding to 7 = 0.1 and A = 0.0) is also displayed 
(brown line) to illustrate its relation to the full multi-domain protein network (see text). C. Two-parameter fit of both direct & indirect "matrix" connectivity 
distribution and average direct & indirect "matrix" connectivity of first neighbor proteins Pt. l26ll (see text). Same two adjusted parameters (7 = 0.1 and 
A = 0.3) as in B while a selection of indirect interactions is added up to a total of 28,000 direct & indirect interactions (see supporting information). 



ric divergence model (Fig. 1) in terms of protein-binding domains 
(i.e. with a single protein-binding site) as illustrated in Fig. 3A. 
This alternative representation of PPI networks provides a theoreti- 
cal framework to model the evolution of the combinatorial logic un- 
derlying PPI networks, as it distinguishes mutually exclusive, direct 
interactions ("XOR") between protein domains (Fig.3A, black solid 
lines) from cummulative, indirect interactions ("AND") within multi- 
protein complexes (Fig.3A, red dashed lines). 

Combining whole genome duplication and domain shuffling. 

As noted in the introduction, whole-genome duplications promote 
efficient shuffling of multi-domain proteins by enabling many accre- 
tion and deletion events of functional domains after each genome 
doubling. We will assume in the following that this shuffling of multi- 
domain proteins is so efficient that protein domains encoded along 
the genome evolve independently from their inclusion in single- or 
multi-domain proteins (indeed, different multi-domain combinations 
are typically observed across living kingdoms fl4ll ). Besides, a more 
elaborate model of protein evolution detailing domain accretion and 
deletion events leads to virtually identical results for the large scale 
topological features of PPI network (not shown). The asymptotic 
generating function p(x) for multi-domain protein networks with in- 
dependent domain evolution can be deduced a posteriori as, 



p(x) = (1 - X)p(x) (1 + Xp(x) + A 2 /(i) + ...) = 



X)p(x) 



\p(x) 



where A is the probability of covalent connection between succes- 
sive protein domains encoded along the genome. This leads to an 
exponential distribution of multi-domain proteins, in agreement with 
actual distributions! 58, 59H . with an average of 1/(1 — A) protein- 
binding sites per protein. While p(x) now reflects the independent 
evolution of single protein-binding domains according to Eqs. 181121 . 
it also controls the asymptotic properties of the derived multi-domain 



networks p(x); in particular, for r o > 1 > F n , we obtain from 
Eq. J12t the following asymptotic expansion in the vicinity of x — 1, 



p(x) = 1 



1 -pQ) 

1 - Xp(x) 



1 - ... - 



1 - A 



{i- x y 



which implies that degree distributions of multi-domain protein net- 
works pk increase with respect to the underlying single-domain inter- 
action network pk as pk ~ P*/(l — A) for large k, while the fraction 
of proteins with a single binding partner p\ decreases at the same time 
as pi = p(0) = (1 - A)p'(0) = (1 - A)pi (see Fig. 3B). Note that 
the scale-free degree distribution of such multi-domain protein net- 
works results from an asymmetric divergence of individual binding 
sites (or domains) rather than asymmetric divergence of global pro- 
tein architectures. This has also consequences for the functionaliza- 
tion of duplicated genes (see supporting information). In particular, 
random (symmetric) "subfunctionalization" between protein dupli- 
cates at the level of protein domains does not prevent the emergence 
of scale-free networks with locally conserved topology, by contrast 
to random link "complementation" at the level of individual inter- 
actions (Fig. SI) which leads to exponential networks without con- 
served topology (as discussed above). 

Hence, domain shuffling of multi-domain proteins provides a pow- 
erful, yet non-disruptive source of combinatorial innovation, as it 
preserves essential topological features inherited from the underly- 
ing protein-domain interaction network evolution. 

Finally, comparison with experimental data sets including indirect 
protein-protein interactions l53l 154 15511 is made by adopting a sta- 
tistical implementation of the "combinatorial logic" discussed above 
(see supporting information). It is based on a Dijkstra algorithm that 
estimates the relative importance of all possible indirect interactions 
between multi-domain (and single-domain) proteins for each PPI net- 
work realization. Figs. 3B&C show rather good fits of experimental 
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data sets corresponding to an estimated 30% to 60% coverage of ac- 
tual PPI networks| 53, 54, 55] (see, however, supporting information). 
The two adjusted parameters, 7 = 0.1 and A = 0.3, correspond to a 
network growth rate of 20% (i.e. 1 + 27) and an average of 1.5 (i.e. 
1/(1 — A)) protein-binding sites (domains) per protein in agreement 
with broad estimates for these biological parameters (see above § and 
15^.15^1 ). This also confirms that the properties of PPI networks we 
have predicted from first principles (i.e. i) exponential dynamics and 
ii) symmetry breaking) are already transparent from partial data sets. 

Discussion 

Beyond whole genome duplications, local genome rearrangements 
such as small segmental duplications, rearrangements and horizontal 
transfers might well have been critical for the emergence and pro- 
liferation of living organisms. Moreover, we note that local dupli- 
cations/deletions may also lead to exponential dynamics of PPI net- 
work evolution if they are selected independently in parallel (expo- 
nential models of local or partial genome duplication are presented 
in ref l25ll ). Yet, recent records (<500MY) from various eukary- 
ote kingdoms (from protists to animals and plants) suggest that the 
majority of duplicates may still have arised from successive whole 
genome duplications (although this will need to be confirmed as more 
fully sequenced eukaryote genomes will become available). 

One possible origin for this less efficient selection of local du- 
plications might be the dosage imbalance they initially induce, 
thereby raising the odds for their rapid nonfunctionalizationloCl. 
loll 16211 (unless proved beneficial under concomitant environmen- 



tal changes l2^1 ). By contrast, rapid nonfunctionalization of dupli- 
cates following a whole genome duplication should be opposed by 
dosage effect. This is because whole genome duplications initially 
preserve correct relative dosage between expressed genes, while sub- 
sequent random nonfunctionalizations disrupt this initial dosage bal- 
ance. Preventing rapid asymmetric divergence between duplicates 
from recent whole genome duplications appears, in the end, to in- 
crease their chance of neo- or subfunctionalization by favoring longer 
(symmetric) genetic drift rather than early (asymmetric) functional 
loss. 

Conclusion 

Large scale topological features of PPI networks emerge "sponta- 
neously" in the course of evolution under simple duplication/deletion 
events 1231 . regardless of the specific evolutionary advantages indi- 
vidual proteins might have been selected for. Yet, the intrinsic ex- 
ponential dynamics of PPI network evolution by whole genome du- 
plications (or independent local duplications selected in parallel 1 25]) 
requires an asymmetric divergence of protein duplicates. Such asym- 
metric divergence arises "naturally" at the level of protein-binding 
sites or domains (through "spontaneous symmetry breaking") and is 
robust to extensive domain shuffling of multi-domain proteins. 
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I. Supplementary Figure. 
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FIG. SI: Alternative Model of PPI network evolution through whole genome duplication with symmetric divergence of duplicated proteins and random link 
"complementation " Il9l . l27ll . 

II. Proof of Recurrence Relations for Generating Functions (Eq.2 and Eq.15). 

After each whole genome duplication, each node has at most doubled its number of neighbors counted through powers of x in the generating 
function. Hence, a given PPI network realization with Nk nodes of connectivity k (k > 0) will contribute to the next duplicated ensemble of 
PPI networks as, 

N k x k -» N k x 2k (16) 
After link deletion with probability S or Si = S , 5 n , it contributes to the x m terms of the generating function (with m = 0, . . . , 2k) as, 



N k x 2k -> N k 



E(5)(t*)V-m (E({)(^H =^(W5)( 7i *+^y 



(17) 



for the asymmetric divergence model (Fig. 1, Eq. 2) and as, 



j=o 



1 ( 3 



■2? V t-o 



h-j 



l nX k-j-i„ 
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-> Nk^-yx + SX^x + S^y (18) 
with 7e = (70 + 7n)/2 and <5 e = (<S + 8 n )/2 for the symmetric divergence model with fo'nfc "complementation " H^.l27ll (Fig. SI, Eq. 15). 



III. Gene functionalization patterns in different models of PPI 
network evolution through whole genome duplication. 

The initial model depicted on Fig. 1 with asymmetric divergence 
of duplicated proteins leads typically to "neofunctionalization" of 
"new" duplicates, while "old" duplicates retain most initial interac- 
tions (if not all for 7 = 1). 

By contrast, the alternative model depicted on Fig. SI with 
symmetric divergence of duplicated proteins and random link 
"complementation " flH.l27ll leads typically to random "subfunction- 
alization" between protein duplicates at the level of individual inter- 
actions. However, this eventually leads to exponential degree distri- 
butions with no topology conservation of the PPI network (see main 
text), whereas scale-free degree distributions with at least local topol- 
ogy conservation of the PPI network indeed emerge under the initial 
asymmetric model, Fig. 1. 

Yet, as discussed in the main text, the necessary asymmetric di- 
vergence of protein duplicates occurs "spontaneously" at the level 
of protein-binding sites rather than of the entire (multi-domain) pro- 
teins, as assumed in Fig. 1. This motivates the redefinition of the 
initial model in terms of protein-binding domains (Fig. 3A) to cap- 
ture the asymmetric divergence of protein duplicates at the level of 
protein-binding sites and allow, at the same time, for extensive do- 
main shuffling events of multidomain proteins (see main text). 

This more elaborate model of PPI network evolution by whole 
genome duplication and domain shuffling encompasses both "neo- 
functionalization" and "subfunctionalization" of gene duplicates at 
the level of protein domains, in agreement with the suggestion that 
gene/protein evolution should be analyzed in terms of domains rather 
than entire proteins flCX Ull \VA IISL Hljl . In addition, this combined 
model of PPI network evolution also provides a theoretical frame- 
work to describe the evolution of the "combinatorial logic" behind 
indirect interactions within multi-protein complexes (see Fig. 3A and 
main text). 

IV. Statistical weighting of indirect interactions from protein 
complexes. 

We use a statistical implemention of the "combinatorial logic" un- 
derlying indirect protein interactions. Indirect interactions between 
protein pairs are weighted by the product of binding site "availabil- 
ities" along the shortest weighted path of intermediate direct inter- 
actions connecting them. The "availability" a; of a binding site i is 
defined as the relative expression level (e^) with respect to its first 
neighbor binding partners j of connectivity dj, 



Oi = 



< 1 



(19) 



Where expression level ej can be distributed with specific statistics, 
such as randomly, uniformly or according to characteristic power 
laws, as reported experimentally 1601 loll loH loll lool . Yet, in prac- 
tice, we found that the predicted large scale topological features of 
PPI networks depend only weakly on the specific distribution of ex- 
pression levels (for reasonable distribution range). 

The statistical probability of an (intermediate) direct interaction 
between domains i and j is then proportional to atctj , which we use in 
a Dijkstra-like algorithm)^}] for additive distance minimization as- 
signing d°j = — ln(a;aj) > weights between interacting domains 
i and j. Because of the presence of both covalent peptide bonds and 



direct, noncovalent interactions between protein domains (Fig. 3A), 
indirect protein-protein interactions correspond to alternating paths 
of noncovalent and covalent interactions with no successive noncova- 
lent interactions which are forbidden by the shared binding site con- 
straint {i.e. a binding site can only interact with one binding partner 
at a time). We describe below an algorithm which performs a simul- 
taneous minimization for paths starting with a covalent bond (dj) 
and paths starting with a direct, noncovalent interaction (dij). (An 
additional variable for second node Vij on the path is also needed to 
avoid non-physical "covalent loops".) 
The initialization of distances between protein domains is: 



Max, v°j = j 



Sij = 0, d°j — Max 
Sij = d°j — Max 



for all pairs, and 

for direct, noncovalent interactions, 

for covalent bonds, 

otherwise. 



We then iterate until convergence (after N 2 x (longest path) opera- 
tions): 

d' i: j = mint d^, min (S ik + c k j)) 

c'ij = mini dj , min (S ik + mm(d kj , c kj ))) 

k£(i) a ,v kj7 ii 

v[j = {k € (i) c I v kj / i, min(<5 ifc + min(d k j , c kj ))} 

and remove eventually the minimum paths starting with a covalent 
bond (to avoid double counting of indirect interactions for multido- 
main proteins below): 



if dij > mm(dj , Cji) 



(20) 



Hence, the probabilities to observe a single indirect interactions 
within protein complexes is given by: 



Wij — f3 exp(-dij) 



otherwise, 



with the normalization condition 'Ei < jWij — 1, which gives 

1/(3 = E 4<i exp(-dij). 

Wij is thus the normalized product of availabilities a k along the 
shortest weighted path between i and j. 

Finally, the individual probabilities ptj to observe a total of M indi- 
rect interactions within protein complexes are given by: 



Pij = 1 - (1- mj) n 
where n is solution of T,i < jPij 



(21) 



M. 



Given the number M of indirect interactions in various data 
sets f53l l54L Is^l . we have assessed their expected contribution to the 
large scale topology of Yeast PPI network from the two-parameter 
7 — A model described in the main text. M ~ 28, 000 corresponds 
to the sum of about 9, 000 direct physical interactions from the 
BIND databaseHl (Fig. 2B&C filled symbols) and about 19, 000 
"matrix" interactions from between 2, 100 proteins already 

involved in direct physical interactions (out of 4, 576 proteins in the 
BIND database, Fig. 3C filled symbols). "Matrix" interactions from 



ref ill (Fig. 3C open symbols) are "reconstructed" from supple- 
mentary information files of [55] as follows: "matrix" interactions 
are included for (each complex core) x (each associated "module") 
and (each complex core) x (each associated "attachment" = one 
protein). This reconstructed dataset should therefore be considered 
as incomplete, since "matrix" interactions between compatible 
modules and/or attachments associated to a given core are not taken 
into account (information not given in l55ll ). 



Numerical fits (7 = 0.1, A = 0.3) are displayed on Fig. 3C (for 
direct and indirect interactions) for both connectivity distribution 
(green) and average connectivity of first neighbors (blue). They 
corresponds to the same adjusted values (7 = 0.1, A = 0.3) as in 
Fig. 3B (for direct interactions only). 
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* There is even a 10 5 -fold span in genome lengths when including 
prokaryotes and 10 8 -fold including viruses. 

f This condition can be shown l25ll to ensure that the evolution of the en- 
semble average of networks (EqQ indeed reflects the "typical" evolu- 
tion of PPI networks under global duplication. 

I When r„ + = r n + r o for exactly some integer r > 1 the last 
term in Eq |12l should be replaced by (1 — x) r ln(l — x), and the limit 
degree distribution decreases like k~ r ~~ 1 (i.e. red/blue lines in Fig.2A). 

§ Ciona (16,000 genes) and human (~25,000 genes) [resp. tetraodon 
(~22,000 genes)] differ by two [resp. three] whole genome duplica- 
tions; this corresponds to an averaged growth rate of 25% [resp. 11%]. 



